Computation and Language
♻ ★ Are Models Trained on Indian Legal Data Fair?
Sahil Girhepuje, Anmol Goel, Gokul S Krishnan, Shreya Goyal, Satyendra Pandey, Ponnurangam Kumaraguru, Balaraman Ravindran
Recent advances and applications of language technology and artificial
intelligence have enabled much success across multiple domains like law,
medical and mental health. AI-based Language Models, like Judgement Prediction,
have recently been proposed for the legal sector. However, these models are
strife with encoded social biases picked up from the training data. While bias
and fairness have been studied across NLP, most studies primarily locate
themselves within a Western context. In this work, we present an initial
investigation of fairness from the Indian perspective in the legal domain. We
highlight the propagation of learnt algorithmic biases in the bail prediction
task for models trained on Hindi legal documents. We evaluate the fairness gap
using demographic parity and show that a decision tree model trained for the
bail prediction task has an overall fairness disparity of 0.237 between input
features associated with Hindus and Muslims. Additionally, we highlight the
need for further research and studies in the avenues of fairness/bias in
applying AI in the legal sector with a specific focus on the Indian context.
comment: Presented at the Symposium on AI and Law (SAIL) 2023
♻ ★ Large Language Models in the Workplace: A Case Study on Prompt Engineering for Job Type Classification
This case study investigates the task of job classification in a real-world
setting, where the goal is to determine whether an English-language job posting
is appropriate for a graduate or entry-level position. We explore multiple
approaches to text classification, including supervised approaches such as
traditional models like Support Vector Machines (SVMs) and state-of-the-art
deep learning methods such as DeBERTa. We compare them with Large Language
Models (LLMs) used in both few-shot and zero-shot classification settings. To
accomplish this task, we employ prompt engineering, a technique that involves
designing prompts to guide the LLMs towards the desired output. Specifically,
we evaluate the performance of two commercially available state-of-the-art
GPT-3.5-based language models, text-davinci-003 and gpt-3.5-turbo. We also
conduct a detailed analysis of the impact of different aspects of prompt
engineering on the model's performance. Our results show that, with a
well-designed prompt, a zero-shot gpt-3.5-turbo classifier outperforms all
other models, achieving a 6% increase in Precision@95% Recall compared to the
best supervised approach. Furthermore, we observe that the wording of the
prompt is a critical factor in eliciting the appropriate "reasoning" in the
model, and that seemingly minor aspects of the prompt significantly affect the
model's performance.